Investigate Titanic's Data

Questions to ask ourselves

What factors made people more likely to survive?

  • Sex
  • Class
  • Age
  • How much they paid

In [3]:
#imports
import pandas as pd
import numpy as np

In [4]:
raw_data = pd.read_csv('titanic_data.csv')

In [26]:
raw_data.head()


Out[26]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

Data Wrangling

We need to find the amount of nulls that our data has.

describe function might be useful


In [9]:
raw_data.describe()


Out[9]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

we realise however that in this way we are not able to see NA in non-numeric columns.

We move to another option:


In [12]:
raw_data.isnull().sum()


Out[12]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

How do we treat nulls

In AGE

Out of 891 rows, we have 177 NaN, which represent roughly a 20%. If we replace this NaN with some other value we should be guard value, so it does not affect the rest of the values.

In Cabin

Out of 891 rows, 687 are nulls, representing an astounding 77%. Ignoring this column altogether makes more sense.

In Embarked

Only 2 NaN in this column make it possible to simply ignore this rows. We could also decide another value and see how they behave.

Code

Age


In [6]:
clean_data = raw_data.copy()
clean_data['Age'] = clean_data['Age'].fillna(-1)

Cabin


In [7]:
clean_data.drop('Cabin', axis=1, inplace=True)

Embarked

Before deleting anything, let's check the rows


In [27]:
raw_data[raw_data['Embarked'].isnull()]


Out[27]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38 0 0 113572 80 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62 0 0 113572 80 B28 NaN

It looks a bit strange that they both survived, are in the same Cabin and we lack their Embarked information, using the same ticket.

Instead of deleting them we will leave the rows for now.

This are configuration options for the charts.


In [5]:
%pylab inline
figsize(47,20)


Populating the interactive namespace from numpy and matplotlib

Data Exploration

We want to be able to see all this data depicted in this ways:

  • How many people survived?
  • Survival by age
  • Survival by sex
  • Survival by age and sex
  • Survival by age and class
  • Survival by sex and class

To be able to see where the survival rates are most gathered.

How many people survived?

As a first data exploration trade we are interested first, in how many people survived.


In [79]:
import matplotlib.pyplot as plt
survivors = clean_data.groupby('Survived').count()['Name']

plt.figure(figsize=(18,8))
cmap = plt.cm.hsv
colors = ['grey','cyan']
plt.pie(survivors, labels=['Died','Survived'], explode=[0,0.05], autopct='%1.1f%%', colors = colors)

plt.axis("equal")
plt.title("Titanic Survivors")
plt.show();


Survival by Age

Code


In [41]:
clean_data[clean_data['Survived'] == 1].groupby('Age').count().reset_index().plot(kind='bar',y='PassengerId', x='Age')
#pd.pivot_table(clean_data[clean_data['Survived'] == 1], index='Age', aggfunc=np.count_nonzero


Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x10ff86310>

But this is not very helpful, since we don't see how many people there was in each group. We can either represent both survivors or not, or calculate a ratio by age.

Let's see which helps us more.


In [84]:
#clean_data.groupby(['Age','Survived']).count().reset_index().plot(kind='bar',stacked = True, y='PassengerId', x='Age')
pivot_age = pd.pivot_table(clean_data, values='PassengerId', index='Age', columns='Survived', aggfunc=np.count_nonzero)
pivot_age.fillna(0).plot(kind='bar', stacked='True')


Out[84]:
<matplotlib.axes._subplots.AxesSubplot at 0x1178e7f90>

From what we can see, not much information can be gained from age, but let's analyse by ratio, to be certain about that.


In [94]:
pivot_age = pivot_age.fillna(0)
pivot_age['survival_ratio'] = pivot_age[1] / (pivot_age[0] + pivot_age[1])
pivot_age.plot(kind = 'bar', y='survival_ratio')


Out[94]:
<matplotlib.axes._subplots.AxesSubplot at 0x12274c4d0>

From this plot we can extract that the higher ratios of survival are up to 9 years, and between 11 and 14. Some other interesting ranges of age have good survival rates, like from 47 to 55.

Survival by Sex

Let's see which sex survived more.

Code


In [99]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=1, ncols=2)
survivors_male = clean_data[clean_data['Sex']=='male'].groupby('Survived').count()['Name']
survivors_female = clean_data[clean_data['Sex']=='female'].groupby('Survived').count()['Name']

colors = ['grey','cyan']

male_plot = survivors_male.plot(kind='pie', labels=['Died','Survived'], explode=[0,0.05], autopct='%1.1f%%', colors = colors, ax=axes[0])
male_plot.axis("equal")
male_plot.set_title("Male Titanic Survivors")

female_plot = survivors_female.plot(kind='pie', labels=['Died','Survived'], explode=[0,0.05], autopct='%1.1f%%', colors = colors, ax=axes[1])
female_plot.axis("equal")
female_plot.set_title("Female Titanic Survivors")


Out[99]:
<matplotlib.text.Text at 0x12b090510>
<matplotlib.figure.Figure at 0x1227e4210>

As we can clearly see with this representation, we have a lot of females surviving. Around a 74 %.

Only with this information we could already have a pretty good prediction.

Survival by age and sex

An interesting set of visualizations might help us see if the highest survival ratios for males are skewed to one particular range of ages. Checking the dead ratio by age with females looks interesting, to avoid it as well.


In [103]:
survivors_male_age_pivot = clean_data[clean_data['Sex']=='male'].pivot_table(index='Age', columns='Survived', aggfunc=np.count_nonzero)
survivors_male_age_pivot = survivors_male_age_pivot.fillna(0)['PassengerId']
survivors_male_age_pivot['survival_ratio'] = survivors_male_age_pivot[1]/(survivors_male_age_pivot[1]+survivors_male_age_pivot[0])
survivors_male_age_pivot.plot(kind='bar', y='survival_ratio')


Out[103]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d2a3550>

With this representation we can clearly see that the 0 to 6 year old males are the ones that survive the most.

With females we want to study which where the ages that died the most, since we have a lot more women surviving.


In [104]:
survivors_female_age_pivot = clean_data[clean_data['Sex']=='female'].pivot_table(index='Age', columns='Survived', aggfunc=np.count_nonzero)
survivors_female_age_pivot = survivors_female_age_pivot.fillna(0)['PassengerId']
survivors_female_age_pivot['dead_ratio'] = survivors_female_age_pivot[0]/(survivors_female_age_pivot[1]+survivors_female_age_pivot[0])
survivors_female_age_pivot.plot(kind='bar', y='dead_ratio')


Out[104]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c9ecb90>

We would have expected something more clear, but this doesn't help us. There is no conclusion that we can draw from this data.

Survival by age and class

First we need to explore the different values we have in class.


In [108]:
clean_data['Pclass'].head()


Out[108]:
0    3
1    1
2    3
3    1
4    3
Name: Pclass, dtype: int64

We see data is structured in values ranging from 1 to 3. Standin for 1st class (richer) to 3rd class (poorer).


In [109]:
survivors_first_age_pivot = get_survival_ratio_pivot(clean_data,'Pclass', 1)
survivors_first_age_pivot.plot(kind='bar', y='survival_ratio')


Out[109]:
<matplotlib.axes._subplots.AxesSubplot at 0x10cf86890>

In [114]:
def get_survival_ratio_pivot(source, attribute, value):
    pivot = source[source[attribute]==value].pivot_table(index='Age', columns='Survived', aggfunc=np.count_nonzero)
    pivot = pivot.fillna(0)['PassengerId']
    pivot['survival_ratio'] = pivot[1]/(pivot[1]+pivot[0])
    return pivot

In [115]:
survivors_second_age_pivot = get_survival_ratio_pivot(clean_data,'Pclass', 2)
survivors_second_age_pivot.plot(kind='bar', y='survival_ratio')


Out[115]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d5a8c90>

This distribution is more revealing. People from second class only got saved if they were extremely young. At this point it would be helpful to know how many people this represented.


In [126]:
survivors_second_age_pivot.columns = ['Died', 'Survived', 'Ratio']
ssap_plot = survivors_second_age_pivot.plot(kind='bar',stacked = True, y=[0,1])
#ssap_plot.set_label(['Died','Survived'])



In [127]:
survivors_third_age_pivot = get_survival_ratio_pivot(clean_data,'Pclass', 3)
survivors_third_age_pivot.plot(kind='bar', y='survival_ratio')


Out[127]:
<matplotlib.axes._subplots.AxesSubplot at 0x13aabf490>

This distribution shows that just by being on 3rd class, your chances of surviving were a lot lower. Let's calculate how lower.


In [130]:
survived_by_class = clean_data.pivot_table(index='Pclass', columns='Survived', aggfunc=np.count_nonzero)['PassengerId']
survived_by_class['ratio'] = survived_by_class[1]/(survived_by_class[1]+survived_by_class[0])
survived_by_class


Out[130]:
Survived 0 1 ratio
Pclass
1 80 136 0.629630
2 97 87 0.472826
3 372 119 0.242363

The trend is clear. Less money, less possibility of survival.

Survival by sex and class

Let's get a pivot table representing as clearly as possible this information.


In [11]:
from pivottablejs import pivot_ui
pivot_ui(clean_data)


Out[11]:

With the help of this tool we see that the best result is:


In [15]:
class_gender_pivot = pd.pivot_table(clean_data, index=['Pclass','Sex'],columns='Survived', aggfunc=np.count_nonzero)['PassengerId']
class_gender_pivot['survival_ratio'] = class_gender_pivot[1]/(class_gender_pivot[1]+class_gender_pivot[0])
class_gender_pivot


Out[15]:
Survived 0 1 survival_ratio
Pclass Sex
1 female 3 91 0.968085
male 77 45 0.368852
2 female 6 70 0.921053
male 91 17 0.157407
3 female 72 72 0.500000
male 300 47 0.135447

With this informations we can say that higher class means life, specially for men, that have their chances more than doubled. Woman in higher and middle class survived. And woman in lower classes had exactly 50% chances of surviving.

Conclusions

After analysing the data, we can state that:

  • Females were more likely to survive than males.
  • Upper classes had higher survival ratios. First had the best survival ratio for men, while 1st and 2nd had best survival ratios for women.
  • Age was a factor but difficult to pin point precisely.